Structured prediction model learning apparatus, method, program, and recording medium

ABSTRACT

A structured prediction model learning apparatus, method, program, and recording medium maintain prediction performance with a smaller amount of memory. An auxiliary model is introduced by defining the auxiliary model parameter set θ (k)  with a log-linear model. A set Θ of auxiliary model parameter sets which minimizes the Bregman divergence between the auxiliary model and a reference function indicating the degree of pseudo accuracy is estimated by using unsupervised data. A base-model parameter set λ which minimizes an empirical risk function defined beforehand is estimated by using supervised data and the set Θ of auxiliary model parameter sets.

TECHNICAL FIELD

The present invention relates to methods of machine learning, and morespecifically, to an apparatus, method, program, and recording medium forlearning a structured prediction model used in a structured predictionsystem that predicts an output structure with respect to an inputstructure written by a discrete structure (a so-called graph). Machinelearning is a technology for learning (extracting) useful regularity,knowledge representations, criteria, and the like from data prepared inadvance for learning.

BACKGROUND ART

The problem of predicting a structure hidden behind certain informationis called a structured prediction problem. An apparatus (or program) forpredicting an output structure with respect to an input structure iscalled a structured prediction system. The input structure and theoutput structure are certain discrete structures, and the structures canbe expressed by so-called graphs (structures constructed by a set ofnodes and a set of edges). The input and output structures can furtherbe expressed by a labeled graph (a graph with label at nodes and/oredges). A model used in the structured prediction system is called astructured prediction model. The structured prediction model is used topredict the likeliest output structure with respect to the inputstructure.

Structured prediction problems in the real world include the following,for example: (1) the problem of predicting a grammatical or semanticstructure from text data; (2) the problem of predicting a proteinstructure from genetic sequence data; (3) the problem of predicting(recognizing) an object included in image data; and (4) the problem ofpredicting a network structure from data expressing person-to-personrelations or object-to-object relations.

Some problems (such as the problems (1) to (4) listed above) in the realworld processed on a computer can be formulated as structured predictionproblems when they are converted into such a form that the computer caneasily handle. Examples are shown in FIGS. 1 to 3. In the mathematicalexpression, here, the input structure is denoted by x, and the outputstructure is denoted by y. The input structure x is one element of a setX of all possible inputs, or xεX. The output structure y is one elementof a set Y of all possible outputs, or yεY. Since the output structure ydepends on the input structure x, y is one element of a set Y(x) of allpossible outputs with respect to x, or yεY(x). In addition, Y(x)⊂Y.

FIG. 1 shows a sequence structured prediction problem of extracting anamed entity from English-language text. In the figure, a proper noun isgiven a label indicating the type of the proper noun.

The shown input structure x, “U.N. official John Smith heads for Baghdadon July 4th.” is segmented into eleven tokens (or words). Six tokens“U.N.”, “John”, “Smith”, “Baghdad”, “July”, and “4th” are labeled ORG,PER., PER., LOC., DATE, and DATE respectively: PER. stands for a personname, LOC. stands for a location name, and ORG. stands for anorganization name.

FIG. 2 shows a tree-structure prediction problem for analyzing thedependency structure in English-language text. FIG. 2 shows an exampleof assigning labels indicating grammatical linking relationships totokens (or words). The input sequence x, “U.N. official John Smith headsfor Baghdad on July 4th.” is tokenized into eleven units. Each token isgiven a label indicating a grammatical linking relationship: The labelgiven to “U.N.” has a link from “Smith” (“x1←x4”); the label given to“official” has a link from “Smith” (“x2←4”); the label given to “John”has a link from “Smith” (“x3←4”); the label given to “Smith” has a linkfrom “heads” (“x4←5”); the label given to “heads” has no link since“heads” is the head word of this sentence; the label given to “for” hasa link from “heads (“x6→x7”); the label given to “Baghdad” has a linkfrom “for” (“x7→x8”); the label given to “on” has a link from “Baghdad”(“x8→x9”); the label given to “July” has a link from “on” (“x9→x10”);the label given to “4th” has a link from “July” (“x10→x11”); and thelabel given to “.” has a link from “heads” (“x11←x5”).

FIG. 3A shows a sequence structured prediction problem of estimating agene region from a DNA base sequence. Base sequences (codons) whichconsist of three bases with four kinds, T, C, A, G, are given labelsrepresenting amino acids: The codon “ATG” is labeled “M”, which standsfor the amino acid Methionine; the codon “TGA” is labeled “H”, whichstands for the amino acid Histidine; the codons between “ATG” and “TGA”are labeled “R”, “D”, “W”, and “Q”; letters before “ATG” and lettersafter “TGA” are labeled “O” to indicate that there are no correspondingamino acids. The label “M” indicates the start codon of proteintranslation and the label “H” indicates the stop codon of proteintranslation.

FIG. 3B shows a problem of predicting a network structure from dataexpressing person-to-person relations or object-to-object relations. Inthe shown example, the input structure is combinations of a person'sname and the person's purchasing history of certain products, and eachperson is labeled the name of a different person having the samepreference. The shown input structure is: (Smith, (A, B, E)), (Johnson,(F, G, J)), (Williams, (A, C, D)), (Brown, (A, B, C, D, E)), (Jones, (A,C, D)), (Miller, (D, F, G, J)), (Davis, (A, F, G, H, J)). Each node(person's name) is given a label indicating a person having the samepreference: Smith is labeled Brown; Johnson is labeled Miller, Davis;Williams is labeled Brown, Jones; Brown is labeled Smith, Williams,Jones; Jones is labeled Williams, Brown; Miller is labeled Johnson,Davis; and Davis is labeled Johnson, Miller.

One choice of the prediction of a correct output structure with respectto an input structure is to make use of the structured prediction modelmade by machine learning method. Methods for learning structuredprediction models which structured prediction systems use in machinelearning are generally classified into three major groups. A first typeof learning uses so-called supervised data, which indicates a correctoutput structure with respect to an input structure. This method iscalled supervised learning since the data is used as a supervisedsignal. The supervised signal is an output structure considered to beideal for a given input structure. Here, the supervised data is given asa set of combinations of an input structure and a supervised signal(ideal output structure). Supervised data having J samples is expressedasD _(L)={(x ^((j)) , y ^((j)))}^(J) _(j=1)An advantage of supervised learning based on supervised data is ahigh-performance structured prediction model can be learned. Adifficulty of predicting (estimating) an output structure is the outputstructure y has interdependent relations that can be expressed by alabeled graph. Accordingly, the relation in the entire output structureshould be considered when the data is created. Expert knowledge aboutthe task is needed in many cases. The cost of creating a large amount ofsupervised data required to learn the structured prediction model isextremely high in terms of manpower, time, and expense. The performanceof supervised learning depends largely on the amount of supervised data.If a sufficient amount of supervised data cannot be prepared, theperformance of the structured prediction model obtained by supervisedlearning with the supervised data would be low.

A second type of learning is unsupervised learning, which uses datawithout a known output structure (hereafter unsupervised data) alone.Unsupervised learning is superior to supervised learning in that thereis no need to worry about the cost of creating supervised data.Unsupervised learning, however, requires some types of prior knowledge,such as a hypothesis and similarity measure between input structures, toprovide sufficient prediction performance. If the prior knowledge is notknown or hard to implement into computer, the structured predictionmodel obtained from unsupervised learning does not provide sufficientprediction performance. Generally, since it is often hard to implementthe prior knowledge in computer, structured prediction models obtainedfrom unsupervised learning often have lower prediction performance thanthose obtained from supervised learning.

A third type of learning is semi-supervised learning, which uses bothsupervised data and unsupervised data. Semi-supervised learning is amethod of improving the prediction performance of supervised learning byusing together with unsupervised data when the amount of supervised datais limited. Therefore, semi-supervised learning has a possibility toprovide a high-performance structured prediction model at low cost.

One known method of learning a structured prediction model bysemi-supervised learning is described in J. Suzuki and H. Isozaki,“Semi-Supervised Sequential Labeling and Segmentation Using Giga-wordScale Unlabeled Data”, Proceedings of ACL-08, 2008, pp. 665-673(hereafter non-patent literature 1). This method is obtained byextending supervised learning of a structured prediction model called aconditional random field (refer to J. Lafferty, A. McCallum, F. Pereira,“Conditional Random Fields: Probabilistic Models for Segmenting andLabeling Sequence Data”, Proceedings of 18th International Conf. onMachine Learning, 2001, pp. 282-289), to semi-supervised learning. Thestructured prediction system using the structured prediction modellearned in this method shows very good prediction performance with realdata.

SUMMARY OF THE INVENTION

A limited amount of supervised data can be used in most cases because ofthe high creation cost. As described earlier, if a sufficient amount ofsupervised data cannot be used for a structured prediction problem, thestructured prediction model obtained from supervised learning does notprovide sufficient prediction performance.

In comparison with supervised data, a larger amount of unsupervised datacan be obtained more easily. However, it is essentially difficult toobtain sufficient prediction performance from the structured predictionmodel learned in unsupervised learning as described earlier.

It is ideal to learn a structured prediction model in semi-supervisedlearning, which uses a small amount of supervised data and a largeamount of unsupervised data.

Since the output structure y is not known in the unsupervised data, whenthe input structure x is given directly, the conditional probabilityp(y|x) of the output structure y cannot be estimated in learning of thestructured prediction model. Therefore, it has been proposed in JapanesePatent Application Laid Open No. 2008-225907 and non-patent literature 1to estimate an output structure by using a model generated with a jointprobability p(x, y). Generally, in semi-supervised learning, ifsupervised data that can be used in learning of the structuredprediction model is limited, an enormous amount of unsupervised data isrequired to obtain sufficient prediction performance. The structuredprediction model obtained from the enormous amount of unsupervised datawould be complicated. The complicated structured prediction modelrequires a large storage area in the structured prediction modelcreating apparatus and the structured prediction system. This can lowerthe prediction speed of the structured prediction system.

Accordingly, it is an object of the present invention to provide anapparatus, method, program, and recording medium for learning astructured prediction model with a reduced memory space whilemaintaining prediction performance.

To solve the above-described problems, in the structured predictionmodel learning technology according to the present invention, astructured prediction model used to predict an output structure ycorresponding to an input structure x is learned, by using superviseddata D_(L) and unsupervised data D_(U). In the structured predictionmodel learning technology according to the present invention, asupervised data output candidate graph for the supervised data and anunsupervised data output candidate graph for the unsupervised data aregenerated by using a set of definition data for generating outputcandidates identified by a structured prediction problem; features areextracted from the supervised data output candidate graph and theunsupervised data output candidate graph by using a feature extractiontemplate, a D-dimensional base-model feature vector f_(x,y)corresponding to a set of the features extracted from the superviseddata output candidate graph is generated, a set of the featuresextracted from the unsupervised data output candidate graph is dividedinto K subsets, and a D_(k)-dimensional auxiliary model feature vectorg^((k)) _(x,y) corresponding to features included in a subset k of the Ksubsets is generated, where K is a natural number and kε{1, 2, . . . ,K}; a base-model parameter set λ which includes a first parameter set wformed of D first parameters in one-to-one correspondence with Delements of the base-model feature vector f_(x,y) is generated, anauxiliary model parameter set θ^((k)) formed of D_(k) auxiliary modelparameters in one-to-one correspondence with D_(k) elements of theauxiliary model feature vector g^((k)) _(x,y) is generated, and a setΘ={θ⁽¹⁾, θ⁽²⁾, . . . , θ^((K))} of auxiliary model parameter sets,formed of K auxiliary model parameter sets θ^((k)) is generated; a set Θof auxiliary model parameter sets which minimizes the Bregman divergencehaving a regularization term obtained from the auxiliary model parameterset θ^((k)), between each auxiliary model q_(k) and a reference function{tilde over (r)}(x, y) which is a nonnegative function and indicates thedegree of pseudo accuracy of the output structure y corresponding to theinput structure x is estimated by using the regularization term and theunsupervised data D_(U), where the auxiliary model q_(k) is obtained bydefining the auxiliary model parameter set θ^((k)) with a log-linearmodel; and a base-model parameter set λ which minimizes an empiricalrisk function defined beforehand is estimated by using the superviseddata D_(L) and the set Θ of auxiliary model parameter sets, where thebase-model parameter set λ includes a second parameter set v={v₁, v₂, .. . , v_(K)} formed of K second parameters in one-to-one correspondencewith K auxiliary models.

EFFECTS OF THE INVENTION

In the present invention, unsupervised data is used to estimateauxiliary model parameters that minimize the Bregman divergence betweena reference function indicating the degree of pseudo accuracy and theauxiliary model. With such a configuration, it becomes possible toapproximately estimate a correct solution of the unsupervised data,whose correct solution is unknown, and use the solution. Even with asmall amount of supervised data, which is expensive in terms of the costof generation, unsupervised data, which is inexpensive in terms of thecost of generation, is additionally used to allow a structuredprediction model to be learned, and further to allow the predictionperformance of the structured prediction model to be improved. In thepresent invention, by defining auxiliary models with log-linear modelsor the like, the technology of L₁ norm regularization can be introducedwhen minimizing the Bregman divergence. By introducing the technology ofL₁ norm regularization, the number of parameters which are active (inother words, non-zero parameters) can be reduced. With such aconfiguration, it is possible to systematically reduce the requiredamount of memory for the structured prediction model without any specialprocessing.

In addition, with the use of such a structured prediction model, theamount of memory required for the structured prediction system can bereduced. Therefore, the time required to load the structured predictionmodel from an external storage device such as an HDD to the main memorycan be reduced. Further, the index retrieval speed for features isincreased, reducing the time for structured prediction.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a sequence structured prediction problem of extracting anamed entity from English-language text;

FIG. 2 shows a tree-structure prediction problem of analyzing thedependency structure in English-language text;

FIG. 3A shows a sequence structured prediction problem of estimating agene region from a DNA base sequence;

FIG. 3B shows a problem of predicting a network structure from dataexpressing person-to-person relations or object-to-object relations;

FIG. 4 shows the relationship between a structured prediction modellearning apparatus 100 and a structured prediction system 7;

FIG. 5 is a functional block diagram of the structured prediction modellearning apparatus 100;

FIG. 6 is a flowchart of processing in the structured prediction modellearning apparatus 100;

FIG. 7 shows English-language supervised data;

FIG. 8 shows English-language unsupervised data;

FIG. 9 shows a set T₁ of definition data for output candidategeneration;

FIG. 10 is an output candidate graph for an English-language inputstructure;

FIG. 11 shows a feature extraction template T₂;

FIG. 12 shows an example of extracting features from the outputcandidate graph with respect to the English-language input structure byusing the feature extraction template T₂;

FIG. 13 shows an example of a feature vector assigned to a node 411shown in FIG. 12;

FIG. 14 shows an example of a feature vector assigned to a node 412shown in FIG. 12;

FIG. 15 shows data examples in a base-model parameter set λ with respectto the English-language input structure;

FIG. 16 shows data examples in a set Θ of auxiliary model parameter setswith respect to the English-language input structure;

FIG. 17 is a functional block diagram of an auxiliary model parameterestimating unit 140;

FIG. 18 is a flowchart of processing in the auxiliary model parameterestimating unit 140;

FIG. 19 is a functional block diagram of a base-model parameterestimating unit 160;

FIG. 20 is a flowchart of processing in the base-model parameterestimating unit 160;

FIG. 21 shows data examples for a parameter set u with respect to theEnglish-language input structure;

FIG. 22 is a block diagram of an example hardware structure of thestructured prediction model learning apparatus 100;

FIG. 23 is a graph showing the accuracy rate of the structuredprediction system using a structured prediction model learned based onthe supervised data only and the accuracy rate of the structuredprediction system using a structured prediction model learned by thestructured prediction model learning apparatus 100 employing auxiliarymodels of type 3; and

FIG. 24 shows an example output candidate graph for an English-languagelinking structure prediction problem.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Outline of Present Invention

A structured prediction model predicts an output structure y withrespect to an input structure x. In an example of the present invention,the structured prediction model is defined by the equation given below:

$\begin{matrix}{{\hat{y} = {\arg{\max\limits_{y \in {Y{(x)}}}{d\left( {x,{y;\lambda},\Theta} \right)}}}}{{d\left( {x,{y;\lambda},\Theta} \right)} = {{w \cdot f_{x,y}} + {\sum\limits_{k = 1}^{K}\;{v_{k}{\theta^{(k)} \cdot g_{x,y}^{(k)}}}}}}} & (1)\end{matrix}$Equation (1) means that a structure is predicted by using a scorecorresponding to features extracted from a combination of the inputstructure x and the output structure y. Equation (1) provides the resultof prediction, supposing that an output structure ŷ having the highesttotal score for the features is the likeliest output structure withrespect to the input structure x. Here, d(x, y; λ, Θ) represents adiscriminant function that returns a score indicating the likelihood ofobtaining the output structure y with respect to the input structure x.The return value of d(x, y; λ, Θ) is a real value. d(x, y) calculates apredetermined expression, by using a base-model parameter set λ and aset Θ of auxiliary model parameter sets, which will be described later.Here, λ={w, v₁, v₂, . . . , v_(K)} and Θ={θ⁽¹⁾, θ⁽²⁾, . . . , θ^((K))}.f_(x,y) represents a D-dimensional base-model feature vector withrespect to the set of features extracted from supervised data D_(L). wdenotes a first parameter set formed of D first parameters in one-to-onecorrespondence with D elements of the base-model feature vector f_(x,y).g^((k)) _(x,y) denotes a D_(k)-dimensional auxiliary model featurevector corresponding to the set of features included in a subset kobtained when the set of features extracted from unsupervised data D_(L)is divided into K subsets. θ^((k)) denotes an auxiliary model parameterset including D_(k) auxiliary model parameters in one-to-onecorrespondence with D_(k) elements of the feature vector g^((k)) _(x,y).v={v₁, v₂, . . . , v_(K)} is a second parameter set formed of K secondparameters in one-to-one correspondence with K auxiliary models. K is anatural number, and kε{1, 2, . . . , K}. The base-model is used toestimate a base-model parameter set λ used in the discriminant function,by using the supervised data D_(L). The auxiliary model is used toestimate the set Θ of auxiliary model parameter sets used in thediscriminant function, by using the unsupervised data D_(U). How toobtain the parameters and vectors will be described later.

An Auxiliary Model

If the structured prediction model is learned with unsupervised data,since the output structure y is not known, the structured predictionmodel cannot be learned by using the output structure y obtained withrespect to the input structure x. According to an example of the presentinvention, a correct output structure y is approximated by using Kauxiliary models. The auxiliary model is defined as a nonnegativefunction. For example, a logistic regression model or a log-linear modelcan be used as the auxiliary model. The set Θ of auxiliary modelparameter sets is estimated to minimize the Bregman divergence betweenthe auxiliary model and a given reference function. To reduce thestorage space of the structured prediction model, an L₁ normregularization term is used. This allows the optimum set Θ of parametersets to be estimated while the number of non-zero parameters isminimized.

Local Structure

In Equation (1), the total number of output structure candidates Y(x)with respect to the input structure x is generally very large. Since thecalculation cost would be too large in many cases, it would be difficultto enumerate all the possible output candidates in Y(x). To reduce thecalculation cost, a method for decomposing an output structure y bylocal structures (sub-structures) z can be used. In this situation, theglobal features of an entire output structure y are not allowed to use,and features obtained only from local structures z can be used forprediction.

The local structures should be defined manually but can be definedfreely in advance in accordance with the target problem. For example,they are defined by cliques of the output structure graph in general.However, it is not necessary to define a local structure by segmentingthe elements of the output structure exclusively. The local structuresmay also be defined in such a manner that a local structure completelyincludes another local structure.

A set of all local structures that can be obtained from an outputstructure y in accordance with its predefinition is denoted by Z(x, y).An element of the local structure set Z(x, y) is denoted by zεZ(x, y). Aset of all local structures included in the set Y(x) of all outputcandidates generated from x is denoted by Z(x, Y(x)).

A feature vector extracted from the information of a given inputstructure x and a certain local structure z will be denoted by f_(x,z)or g^((k)) _(x,y). Suppose that the following equations hold in anexample of the present invention.

$\begin{matrix}{f_{x,y} = {\sum\limits_{z \in {Z{({x,y})}}}\; f_{x,z}}} & (2) \\{{g^{(k)}x},{y = {\sum\limits_{z \in {Z{({x,y})}}}g_{x,z}^{(k)}}}} & (3)\end{matrix}$Equation (2) means that the total of feature vectors f_(x,z) obtainedfrom the local structures z becomes the feature vector f_(x,y) of theentire output structure. Equation (3) means that the total of featurevectors g^((k)) _(x,z) obtained from the local structures z becomes thefeature vector g^((k)) _(x,y) of the entire output structure. Equation(1) can be expressed by Equation (1)′ below.

$\begin{matrix}{{\hat{y} = {\arg{\max\limits_{y \in {Y{(x)}}}{d\left( {x,{y;\lambda},\Theta} \right)}}}}\begin{matrix}{{d\left( {x,{y;\lambda},\Theta} \right)} = {{w \cdot f_{x,y}} + {\sum\limits_{k = 1}^{K}\;{v_{k}{\theta^{(k)} \cdot g_{x,y}^{(k)}}}}}} \\{= {{\sum\limits_{z \in {Z{({x,y})}}}{w \cdot f_{x,z}}} + {\sum\limits_{k = 1}^{K}\;{\sum\limits_{z \in {Z{({x,y})}}}{v_{k}{\theta^{(k)} \cdot g_{x,z}^{(k)}}}}}}}\end{matrix}} & (1)^{\prime}\end{matrix}$

A base-model P and an auxiliary model q_(k) will be defined next, whereq_(k) stands for a k-th auxiliary model. Three types of auxiliary modelswill be described below. Any type of auxiliary model is a nonnegativefunction, and the auxiliary model parameter set θ^((k)) is defined by alog-linear model.

Auxiliary Model of Type 1

A k-th auxiliary model of type 1 is denoted by q¹ _(k). A conditionalprobability q(y|x) that an output structure y is output with respect tox and a conditional probability q(

y|x)=1−q(y|x) of the opposite are expressed by the following equations.

$\begin{matrix}{{{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} = \frac{\exp\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}{{b(y)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}}}\begin{matrix}{{q_{k}^{1}\left( {{{⫬ y}❘x};\theta^{(k)}} \right)} = {1 - {q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)}}} \\{= \frac{b(y)}{{b(y)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}}}\end{matrix}} & (4)\end{matrix}$Here, b(y)==Σ_(z)b(z) is a function that returns a value more or equalsto 1. b(y) is assigned such a value that Equation (4) matches with auniform distribution when θ^((k))·g^((k)) _(x,y)=0. Next, q′¹ _(k) willbe defined by using the odds for q¹ _(k).

$\begin{matrix}\begin{matrix}{{q_{k}^{\prime 1}\left( {{y❘x};\theta^{(k)}} \right)} = {{odds}\left( q_{k}^{1} \right)}} \\{= \frac{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)}{q_{k}^{1}\left( {{{⫬ y}❘x};\theta^{(k)}} \right)}} \\{= \frac{\exp\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}{b(y)}} \\{\propto {q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)}}\end{matrix} & (5)\end{matrix}$Therefore, q′¹ _(k) has a value of q¹ _(k) multiplying by 1/b(y).Namely, q′¹ _(k) is proportional to q¹ _(k).

For the subsequent processing, Q_(k) is defined as follows.Q _(k)(z|x;θ ^((k)))=q ¹ _(k)(z|x,θ ^((k)))  (6)Here, q¹ _(k)(z|x, θ^((k))) represents the conditional probability thatthe local structure z appears in the output structure y obtained withrespect to x. This conditional probability can be calculated as amarginal probability of z from the definition of q′_(k)(y|x, θ^((k))).

Auxiliary Model of Type 2

An auxiliary model of type 2 is denoted by q² _(k). The auxiliary modelq² _(k) uses a simpler structure to reduce the calculation cost. It doesnot model the output structure y obtained with respect to the inputstructure x, but the local structures z in the output structure y. Aconditional probability q(z|x) of the local structure z in the outputstructure y, given the input structure x, and a conditional probabilityq(

z|x)=1−q(z|x) are expressed by the equations given below.

$\begin{matrix}{{{q_{k}^{2}\left( {{z❘x};\theta^{(k)}} \right)} = \frac{\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}{{b(z)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}\begin{matrix}{{q_{k}^{2}\left( {{{⫬ z}❘x};\theta^{(k)}} \right)} = {1 - \frac{\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}{{b(z)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}} \\{= \frac{b(z)}{{b(z)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}\end{matrix}} & (7)\end{matrix}$Here, b(z) represents the number of local structures that are rivalcandidates of the local structure z. This means that b(z) is acorrection term such that the probability of appearance of z matches theprobability of a rival candidate as shown below if θ^((k))·g^((k))_(x,z)=0 is the default value.

$\frac{1}{{b(z)} + 1} = {\overset{\_}{p}\left( {z❘x} \right)}$Next, q′² _(k) is defined as follows, by using the odds for q² _(k).

$\begin{matrix}\begin{matrix}{{q_{k}^{\prime 2}\left( {{z❘x};\theta^{(k)}} \right)} = {{odds}\left( q_{k}^{2} \right)}} \\{= \frac{q_{k}^{2}\left( {{z❘x};\theta^{(k)}} \right)}{q_{k}^{2}\left( {{{⫬ z}❘x};\theta^{(k)}} \right)}} \\{= \frac{\exp\left\lbrack {\theta^{(k)} \cdot {g^{(k)}\left( {x,z} \right)}} \right\rbrack}{b(z)}} \\{\propto {q_{k}^{2}\left( {{z❘x};\theta^{(k)}} \right)}}\end{matrix} & (8)\end{matrix}$For the subsequent processing, Q_(k) is defined as follows.

$\begin{matrix}\begin{matrix}{{Q_{k}\left( {{z❘x};\theta^{(k)}} \right)} = {{b(z)}{{odds}\left( q_{k}^{2} \right)}}} \\{= {{b(z)}q_{k}^{\prime 2}}} \\{= {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}\end{matrix} & (9)\end{matrix}$

Auxiliary Model of Type 3

An auxiliary model of type 3 is denoted by q³ _(k). With q³ _(k), thecalculation cost of the auxiliary model can be reduced further. Theauxiliary model of type 3 models the probability of appearance of eachlocal structure z with a single feature n alone. A conditionalprobability q(z|x, n), which is a local structure z having a feature nin the output structure y given the input structure x, and a conditionalprobability q(

z|x,n)=1−q(z|x,n) of the opposite are expressed by the followingequations.

$\begin{matrix}{{{q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)} = \frac{\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}{{b(z)} + {\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}}}\begin{matrix}{{q_{k}^{3}\left( {{{⫬ z}❘x},{n;\theta^{(k)}}} \right)} = {1 - \frac{\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}{{b(z)} + {\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}}}} \\{= \frac{b(z)}{{b(z)} + {\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}}}\end{matrix}} & (10)\end{matrix}$Next, q′³ _(k) is defined as follows, by using the odds for q³ _(k).

$\begin{matrix}\begin{matrix}{{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta^{(k)}}} \right)} = {{odds}\left( q_{k}^{3} \right)}} \\{= \frac{q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)}{q_{k}^{3}\left( {{{⫬ z}❘x},{n;\theta^{(k)}}} \right)}} \\{= \frac{\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}{b(z)}} \\{\propto {q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)}}\end{matrix} & (11)\end{matrix}$For the subsequent processing, Q_(k) is defined as follows, whereθ^((k))=θ^((k)) ₁, θ^((k)) ₂, . . . , θ^((k)) _(Dk)) and g^((k))_(x,z)=(g^((k)) _(x,z,1), g^((k)) _(x,z,2), . . . , g^((k)) _(x,z,Dk)).

$\begin{matrix}\begin{matrix}{{Q_{k}\left( {{z❘x};\theta^{(k)}} \right)} = {\prod\limits_{n}{{b(z)}{{odds}\left( q_{k}^{3} \right)}}}} \\{= {\prod\limits_{n}{{b(z)}q_{k}^{\prime 3}}}} \\{= {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}\end{matrix} & (12)\end{matrix}$Here, Q_(k) is the odds for q³ _(k) or q′³ _(k) integrated for n=1, 2, .. . , D_(k).

Base-model

The base-model can be any model if the learned structured predictionmodel can be expressed by Equation (1)′. It is defined by using Q_(k)obtained from the auxiliary models q¹ _(k), q² _(k), and q ³ _(k) inEquations (6), (9), and (12) respectively. Examples of defining thebase-model in accordance with the probability model and in accordancewith the max-margin model principle will be described next.

Defining the Base-model in Accordance with the Probability Model

The definition of the base-model P in accordance with the probabilitymodel is expressed as given below.

$\begin{matrix}{\begin{matrix}{{P\left( {{{y❘x};\lambda},\Theta} \right)} = {\frac{1}{Z\left( {x,\lambda,\Theta} \right)}{\exp\left\lbrack {w \cdot f_{x,y}} \right\rbrack}{\prod\limits_{k}{Q_{k}\left( {{y❘x};\theta^{(k)}} \right)}^{v_{k}}}}} \\{= {\frac{1}{Z\left( {x,\lambda,\Theta} \right)}{\prod\limits_{z \in {Z{({x,y})}}}{\exp\left\lbrack {w \cdot f_{x,z}} \right\rbrack}}}} \\{\prod\limits_{k}{Q_{k}\left( {{z❘x};\theta^{(k)}} \right)}^{v_{k}}}\end{matrix}{{Z\left( {x,\lambda,\Theta} \right)} = {\sum\limits_{y \in {Y{(x)}}}{{\exp\left\lbrack {w \cdot f_{x,y}} \right\rbrack}{\prod\limits_{k}{Q_{k}\left( {{y❘x};\theta^{(k)}} \right)}^{v_{k}}}}}}} & (13)\end{matrix}$This equation means that the conditional probability P(y|x) that theoutput structure y is output with respect to the input structure x isdefined as the product of the log-linear model and the auxiliary modelfor the local structures z.No matter which one of the auxiliary models q¹ _(k), q² _(k), and q ³_(k) is used, the right-hand side of Equation (13) can be reduced asfollows.

$\begin{matrix}{{{P\left( {{{y❘x};\lambda},\Theta} \right)} = {\frac{1}{Z^{\prime}\left( {x,\lambda,\Theta} \right)}{\prod\limits_{z \in {Z{({x,y})}}}{\exp\left\lbrack {d\left( {,{z;\lambda},\Theta} \right)} \right\rbrack}}}}{{Z^{\prime}\left( {x,\lambda,\Theta} \right)} = {\sum\limits_{y \in {Y{(x)}}}{\prod\limits_{z \in {Z{({x,y})}}}{\exp\left\lbrack {d\left( {x,{z;\lambda},\Theta} \right)} \right\rbrack}}}}{{d\left( {x,{z;\lambda},\Theta} \right)} = {{w \cdot f_{x,z}} + {\sum\limits_{k}{v_{k}\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}}} & (14)\end{matrix}$Especially when there is a single auxiliary model (K=1) and when thebase-model feature vector and the auxiliary model feature vector are thesame (f_(x,y)=g⁽¹⁾ _(x,y)), Equation (14) can be reduced to express d asgiven below.d(x,z;λ,Θ)=(w+vθ)·f _(x,z)  (15)

Defining the Base-model in Accordance with the Max-margin Model

Definition of the base-model P in accordance with the maximization ofthe margin in the linear identification model is expressed as givenbelow.

$\begin{matrix}{{{P\left( {x,{y;\lambda},\Theta} \right)} = {\max\left\lbrack {0,{{E\left( {y,\hat{y}} \right)} - {d\left( {x,{y;\lambda},\Theta} \right)} + {d\left( {x,{\hat{y};\lambda},\Theta} \right)}}} \right\rbrack}}{\hat{y} = {{\arg{\max\limits_{y^{\prime} \in {{Y{(x)}}\backslash y}}{d\left( {x,{y^{\prime};\lambda},\Theta} \right)}}} + {E\left( {y,y^{\prime}} \right)}}}} & (16)\end{matrix}$Here, E(y, ŷ) is a function expressing the degree of error for ŷobtainedby comparing a correct output y with a certain output ŷ. The value ofE(y, ŷ) increases as the error increases, or as the difference between yand ŷ increases. A\B represents a difference set obtained by subtractingthe set B from the set A. Equation (16) means that the differencebetween the score d(x, y; λ, θ) of the correct output structure y withrespect to the input structure x and the score d(x, ŷ;λ, θ) of anincorrect output structure ŷ with the highest risk becomes the estimatederror E(y, ŷ) or greater.

Embodiments of the present invention will be described below in detailwith these definitions.

First Embodiment

In a learning phase, a structured prediction model learning apparatus100 learns a structured prediction model by using supervised data D_(L),unsupervised data D_(U), and information stored in a learning supportinformation memory 4 and outputs the learned structured prediction modelto a structured prediction model memory 6, as shown in FIG. 4. Thestructured prediction model memory 6 stores the learned structuredprediction model. A structured prediction system 7 receives thestructured prediction model from the structured prediction model memory6. The structured prediction system 7 further receives an unlabeledsample S_(U) (input structure x), estimates a labeled sample S_(L)(output structure y) corresponding to the unlabeled sample by using thestructured prediction model, and outputs the labeled sample.

The supervised data D_(L) is a set of combinations of an input structurex and a supervised signal (ideal output structure y). Supervised data ofN samples is expressed asD _(L)={(x ^((n)) ,y ^((n)))}^(N) _(n=1)The unsupervised data D_(U) is a set of data of input structures xalone, and their correct output structures y are not known. Unsuperviseddata of M samples is expressed asD _(U)={(x ^((m)))}^(M) _(m=1)For named entity extraction as shown in FIG. 1, for example,approximately N=10,000 samples and M=10,000,000 samples or greater arerequired to learn a structured prediction model.

The learning support information memory 4 stores a feature extractiontemplate T₂ and a set T₁ of definition data for output candidategeneration, which will be described later, as learning supportinformation.

Structured Prediction Model Learning Apparatus 100

The structured prediction model learning apparatus 100 in the firstembodiment will be described with reference to FIGS. 5 and 6. Thestructured prediction model learning apparatus 100 is formed, forexample, of a central processing unit (CPU), a random access memory(RAM), a read-only memory (ROM), a hard disk drive (HDD), and aninput-output interface.

The structured prediction model learning apparatus 100 includes, forexample, a memory 103, a controller 105, an output candidate graphgenerator 110, a feature vector generator 120, a parameter generator130, an auxiliary model parameter estimating unit 140, a base-modelparameter estimating unit 160, a first convergence determination unit180, and a parameter integrating unit 190. The output candidate graphgenerator 110 and the feature vector generator 120 are provided toperform preprocessing for learning.

Memory 103 and Controller 105

The memory 103 is formed of the RAM, ROM, HDD, and the like. The memory103 stores the supervised data D_(L), the unsupervised data D_(U), thelearning support information, signals and parameters in the middle ofprocessing, and so on. The controller 105 is formed of the CPU and thelike. The controller 105 reads and writes signals or parameters from orinto the memory 103 during processing. The controller 105 does notnecessarily need to read data from or write data into the memory 103 andmay control individual units to exchange data directly.

Examples of Input Data

Described below is an example of learning a structured prediction modelused to predict an output structure to which a label representing anamed entity is given, from an input structure formed ofEnglish-language text data. FIGS. 7 and 8 show examples of informationinput to the structured prediction model learning apparatus 100 shown inFIG. 5. FIG. 7 shows English-language supervised data; and FIG. 8 showsEnglish-language unsupervised data. The example shown in FIG. 7 isidentical to the examples shown in FIG. 1. Division into tokens iscarried out in advance.

FIG. 9 shows the set T₁ of definition data for output candidategeneration in this embodiment. The shown set of definition data foroutput candidate generation consists of five predetermined definitiondata items for output candidate generation. The set T₁ of definitiondata for output candidate generation is determined automatically by atarget structure prediction problem. The structured prediction modellearning apparatus 100 obtains the set T₁ of definition data for outputcandidate generation from the learning support information memory 4.

Output Candidate Graph Generator 110

The output candidate graph generator 110 receives the supervised dataD_(L), the unsupervised data D_(U), and the set T₁ of definition datafor output candidate generation. The output candidate graph generator110 generates a supervised-data output candidate graph Gr_(DL)corresponding to the received supervised data D_(L) by using the set T₁of definition data for output candidate generation identified by thestructure prediction problem (s110). The output candidate graphgenerator 110 also generates an unsupervised-data output candidate graphGr_(DU) corresponding to the received unsupervised data D_(U) by usingthe set T₁ of definition data for output candidate generation (s110).The output candidate graph generator 110 associates the receivedsupervised data D_(L) with the supervised-data output candidate graphGr_(DL) generated from the supervised data D_(L). The output candidategraph generator 110 associates the received unsupervised data D_(U) withthe unsupervised-data output candidate graph Gr_(DU) generated from theunsupervised data D_(U). The output candidate graph generator 110further outputs the data items to the feature vector generator 120. Theoutput candidate graph is expressed as a lattice of all possible outputstructure candidates connected by paths, as shown in FIG. 10. In theexamples described below, the set of definition data for outputcandidate generation consists of three predetermined definition dataitems for output candidate generation. In the examples shown in FIG. 10,the set of definition data for output candidate generation consists ofthree definition data items for output candidate generation: PER., ORG.,and O. FIG. 10 shows an example of an output candidate graph generatedby the structured prediction model learning apparatus 100 shown in FIG.5. In the graph, <BOS> is a special fixed label representing thebeginning of an input structure x; <EOS> is a special fixed labelrepresenting the end of the input structure x. The lattice representsoutput structures y corresponding to an input structure x (superviseddata D_(L) or unsupervised data D_(U)); each node represents an instancey^(e) (e=1, 2, 3) of the output structures y; and each link representsdependency between instances. A single path between <BOS> and <EOS> inthe output candidate graph corresponds to a single output, and theoutput candidate graph includes all possible output candidates. Forexample, the output candidate graph in FIG. 10 includes 3⁸ differentpaths (output candidates) each. Node 401 in FIG. 10 indicates an outputinstance where a fourth word “SD” in the input structure x is labeled“ORG.” Node 402 in FIG. 10 indicates an output instance where a sixthword “two” in the input structure x is labeled “O”.

Feature Vector Generator 120

The feature vector generator 120 receives a feature extraction templateT₂, the supervised-data output candidate graph Gr_(DL), and theunsupervised-data output candidate graph Gr_(DU). The feature vectorgenerator 120 extracts features from the supervised-data outputcandidate graph Gr_(DL) and the unsupervised-data output candidate graphGr_(DU) by using the received feature extraction template T₂ (s120). Thefeature vector generator 120 generates a feature vector f_(x,y) for aD-dimensional base-model, corresponding to the set of features extractedfrom the supervised-data output candidate graph Gr_(DL) (s120). Thefeature vector generator 120 divides the set of features extracted fromthe unsupervised-data output candidate graph Gr_(DU) into K subsets. Thefeature vector generator 120 generates a feature vector g^((k)) _(x,y)for a D_(k)-dimensional auxiliary model, corresponding to a featureincluded in a subset k of the K subsets (s120). The feature vectorgenerator 120 assigns the feature vectors f_(x,y) for the base-model tothe supervised-data output candidate graph Gr_(DL) and outputs them tothe parameter generator 130. The feature vector generator 120 assignsthe feature vectors g^((k)) _(x,y) for the auxiliary model to theunsupervised-data output candidate graph Gr_(DU) and outputs them to theparameter generator 130.

How the feature vector generator 120 extracts a feature from an outputcandidate graph will be described below. The feature vector generator120 extracts a feature from the output candidate graph in accordancewith a combination of a label y_(i) and an instance in the inputstructure described in the feature extraction template, where the labely_(i) is the i-th label of an output structure. FIG. 11 shows an exampleof the feature extraction template T₂. Using the feature extractiontemplate T₂, the feature vector generator 120 extracts, as features,combinations (y_(i) & x_(i−2), y_(i) & x⁻¹, y_(i) & x_(i), y_(i) &x_(i+1), and y_(i) & x_(i+2)) of the label y_(i) and up to two inputwords x_(i−2), x_(i−1), x_(i), x_(i+1), and x_(i+2) before and after thelabel, a combination (y_(i) & x_(i+1) & x_(i+2)) of the label y_(i) andtwo input words x_(i+1) and x_(i+2) after the label, and the like. FIG.12 shows an example in which the feature vector generator 120 extractsfeatures from the output candidate graph by using the feature extractiontemplate T₂.

In FIG. 12, the first label (y₁) of output structures is “PER.” at node411. The feature vector generator 120 extracts features 411A indicatedin FIG. 12. The third label (y₃) of output structures is “ORG” at node412. The feature vector generator 120 extracts features 412A indicatedin FIG. 12. The fifth label (y₅) of output structures is “ORG.” at node413. The feature vector generator 120 extracts features 413A indicatedin FIG. 12.

How to generate and assign a feature vector will be described next. Thefeature vector generator 120 collects features extracted from all nodesof all supervised-data output candidate graph Gr_(DL) obtained from allthe supervised data D_(L), eliminates identical features, and generatesa supervised-data feature set. The number of elements included in eachunsupervised-data feature set should be D.

The feature vector generator 120 also collects features extracted fromall nodes of the unsupervised-data output candidate graph Gr_(Du)obtained from all unsupervised data D_(U), eliminates identicalfeatures, and generates an unsupervised-data feature set. The featurevector generator 120 divides the unsupervised-data feature set into Ksubsets. It would be better to divide the unsupervised-data feature setin accordance with feature types. The feature types may be the mediumtype (newspaper, web, etc.), content type (economy, sports, etc.),author, and the like, of the unsupervised data. The number of elementsincluded in each subset should be D_(k). Since some different featuretypes may have different distributions, this configuration can improvethe prediction performance.

The feature vector generator 120 assigns a feature vector to each node(or each link) of the output candidate graph. The base-model featurevector f_(x,y) is a D-dimensional vector consisting of elements inone-to-one correspondence with the elements of the feature set extractedfrom the supervised-data output candidate graph Gr_(DL). The auxiliarymodel feature vector g^((k)) _(x,y) is a D_(k)-dimensional vectorconsisting of elements in one-to-one correspondence with the elements ofa subset of the feature set extracted from the unsupervised-data outputcandidate graph Gr_(DU). No matter whether the source is thesupervised-data output candidate graph Gr_(DL) or the unsupervised-dataoutput candidate graph Gr_(DU), the feature vector is assigned in thesame way. FIG. 13 shows a feature vector assigned to node 411 in FIG.12. FIG. 14 shows a feature vector assigned to node 412 in FIG. 12. Thefeature vector generator 120 gives a value “1” to a feature extractedfrom the node and a value “0” to a feature that cannot be extracted fromthe node, so a feature vector having elements “1” or “0” is generated.The feature vector generator 120 assigns the generated feature vector tothe corresponding node. The feature vector generator 120 generatesfeatures in accordance with a combination of each label and each of upto two input words before and after the label, and the like. Therefore,the feature vectors of nodes corresponding to the i-th word of the inputstructure but having different i-th labels in the output structure areorthogonal vectors. The inner product of them is 0.

Parameter Generator 130

The parameter generator 130 receives the supervised-data outputcandidate graph Gr_(DL) having the base-model feature vectors f_(x,y)assigned thereto and the unsupervised-data output candidate graphGr_(DU) having the auxiliary model feature vectors g^((k)) _(x,y)assigned thereto. The parameter generator 130 generates a base-modelparameter set λ that includes a first parameter set w={w₁, w₂, . . . ,w_(D)} consisting of D first parameters in one-to-one correspondencewith the D elements of the base-model feature vector f_(x,y) (s130) andoutputs the set to the base-model parameter estimating unit 160.

The parameter generator 130 also generates an auxiliary model parameterset θ^((k))={θ^((k)) ₁, θ^((k)) ₂, . . . , θ^((k)) _(D)} consisting ofD_(k) auxiliary model parameters in one-to-one correspondence with theD_(k) elements of the auxiliary model feature vector g^((k)) _(x,y). Theparameter generator 130 further generates a set Θ={θ⁽¹⁾, θ⁽²⁾, . . . ,θ^((k))} of auxiliary model parameter sets consisting of K auxiliarymodel parameter sets θ^((k)) (s130), and outputs the set to theauxiliary model parameter estimating unit 140.

The parameter generator 130 specifies “0” as the initial value of eachparameter, for example. FIG. 15 shows a data example in the base-modelparameter sets λ. FIG. 16 shows a data example in the sets Θ ofauxiliary model parameter sets.

The parameter generator 130 may also generate a parameter t=0,indicating the number of estimation repetitions of the auxiliary modelparameter estimating unit 140 and the base-model parameter estimatingunit 160.

Auxiliary Model Parameter Estimating Unit 140

The auxiliary model parameter estimating unit 140 obtains aregularization term from an auxiliary model parameter set θ^((k)). Theauxiliary model parameter estimating unit 140 further estimates a set Θof auxiliary model parameter sets that minimizes the Bregman divergencewith a regularization term between a reference function {tilde over(r)}(x,y) and an auxiliary model q_(k), by using the unsupervised dataD_(U) (s140).

The auxiliary model parameter estimating unit 140, for example, receivesthe reference function {tilde over (r)}(x,y), the set Θ of auxiliarymodel parameter sets, and the unsupervised-data output candidate graphGr_(DU) having the auxiliary model feature vectors g^((k)) _(x,y)assigned thereto.

The auxiliary model parameter estimating unit 140 estimates the set Θ ofauxiliary model parameter sets that minimize the Bregman divergencebetween the reference function {tilde over (r)} and the auxiliary modelq_(k). To minimize the Bregman divergence between the reference functionand the auxiliary model means to obtain a set Θ of auxiliary modelparameter sets that bring the auxiliary model closest to the referencefunction in the solution space. When the Bregman divergence isminimized, an L₁ regularization term is included. This enables thememory space needed to store the learned structured prediction model tobe reduced. If each auxiliary model q_(k) defines an auxiliary modelparameter set θ^((k)) by a probability model, the total must be 1, andthe L₁ regularization term cannot be included. In this embodiment, eachauxiliary model q_(k) defines an auxiliary model parameter set θ^((k))by a log-linear model. Since the log-linear model does not have therestriction given above, the L₁ regularization term can be included.

Reference Function

A reference function is defined first. The reference function {tildeover (r)}(x, y) is a nonnegative function. The value range is [0, ∞).When using the above auxiliary models q¹ _(k), q² _(k), and q ³ _(k),the value range of the reference function is [0,1], because the valuerange of the above auxiliary models is also [0,1]. The referencefunction {tilde over (r)}(x, y) represents the degree of pseudo accuracyof the output structure y with respect to the input structure x. Forexample, when the auxiliary model parameter estimating unit 140estimates auxiliary model parameters for the first time, supervised dataD_(L) should be used beforehand to estimate a first parameter set w(Japanese Patent Application Laid Open No. 2008-225907), and abase-model obtained by defining the estimated first parameter set w witha log-linear model is used as the reference function (there is noauxiliary model, and the elements of a second parameter v are set to 0,for example). In this case, {tilde over (r)}(x, z)=P(z|x, w*).

When the auxiliary model parameter estimating unit 140 estimatesauxiliary model parameters for the second or subsequent time, thebase-model P(z|x, λ^(t−1), Θ^(t−1)) obtained in the preceding session ofrepeated calculation is used as the reference function. A functionpredetermined by a person or a completely different model (such as thelanguage analysis model mentioned in Japanese Patent Application LaidOpen No. 2008-225907) can also be used as the reference function.

Bregman Divergence

The Bregman divergence B_(F) between the reference function {tilde over(r)} and the auxiliary model q_(k) is defined as given below.B _(F)({tilde over (r)}∥q _(k))=F({tilde over (r)})−F(q _(k))−∇F(q_(k))·(q _(k) −{tilde over (r)})  (21)Here, F is a continuously differentiable real-valued and strictly convexfunction. For example, an L₂ norm is used as F. In this embodiment,F(x)=Σxlogx−Σx. Then, the Bregman divergence B_(F) becomes identical toa generalized relative entropy G, as indicated by Equation (22) below.

$\begin{matrix}\begin{matrix}{B_{{F{(x)}} = {{\sum{x\;\log\; x}} - {\sum x}}} = {G\left( \overset{\sim}{r}||q_{k} \right)}} \\{= {{\sum{\overset{\sim}{r}{\log\left\lbrack \overset{\sim}{r} \right\rbrack}}} - {\sum{\overset{\sim}{r}{\log\left\lbrack q_{k} \right\rbrack}}} - {\sum\overset{\sim}{r}} + {\sum q_{k}}}}\end{matrix} & (22)\end{matrix}$

In the end, the estimation of the set Θ of auxiliary model parametersets means to minimize the generalized relative entropy G with an L₁norm regularization term between the reference function {tilde over (r)}and the auxiliary model q_(k). To obtain the correct generalizedrelative entropy, all possible input and output pairs (x,y) arerequired. However, it is impossible to list all possible input andoutput pairs. Therefore, in real situation, unsupervised data D_(U) areused instead of all possible input and output pairs. A generalizedrelative entropy obtained by using the limited size of observed data iscalled an empirical generalized relative entropy and is denoted by Ĝ_(D)_(U) . The equation for obtaining an optimum set Θ of auxiliary modelparameter sets (equation for minimizing the empirical generalizedrelative entropy U(Θ|D_(U)) with a regularization term) is expressed asfollows.

$\begin{matrix}{{\Theta^{*} = {\arg{\min\limits_{\Theta}{U\left( {\Theta ❘D_{U}} \right)}}}}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{G}}_{D_{U}}\left( \overset{\sim}{r}||q_{k} \right)}}}}} & (23)\end{matrix}$Here, |θ^((k))|₁ denotes the L1 norm of the k-th auxiliary modelparameter set θ^((k)). C_(U) is a variable for adjusting the degree ofimportance of the first and second terms on the right-hand side. Thismeans that C_(U) determines whether the empirical generalized relativeentropy or the L₁ regularization term is regarded as more important.C_(U) is also a hyper parameter to be adjusted manually.

An empirical generalized relative entropy U(Θ|D_(U)) with aregularization term obtained by using an auxiliary model q¹ _(k), q²_(k), q³ _(k), q′¹ _(k), q′² _(k), or q′³ _(k) will be described next.

Using q¹

When q¹ _(k) is used as an auxiliary model, the empirical generalizedrelative entropy U(Θ|D_(U)) with a regularization term is expressed asfollows, by using Equations (22), (23), and (4).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{\overset{\sim}{r}\left( {x,y} \right)}\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}}}} +}} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{\log\left\lbrack {{b(y)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix} & (24)\end{matrix}$Here, const(θ^((k))) is a collective value of constant terms withrespect to θ^((k)). In optimization (when U(Θ|D_(U)) is minimized),const(θ^((k))) does not affect the solution. The gradient of Equation(24) is expressed as follows.

$\begin{matrix}{{\nabla_{k}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{\overset{\sim}{r}\left( {x,y} \right)}\left\lbrack g_{x,y}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)}\left\lbrack g_{x,y}^{(k)} \right\rbrack}}}}} & (25)\end{matrix}$Here, σ(a) is a function that returns a vector in response to vector a,replacing each element of vector a by −1, 0, or 1. If the element isgreater than 0, σ(a) replaces the value by 1; if the element is smallerthan 0, σ(a) replaces the value by −1; and if the element is 0, σ(a)replaces the value by 0. The equation (23) obtains the optimum value(obtains the minimum value of U(Θ|D_(U))) when ∇_(k)U(Θ|D_(U))=0 at allvalues of k. Actually, the optimum value can be obtained by agradient-based optimization method.

Using q′¹

When q′¹ _(k) is used, the empirical generalized relative entropyU(Θ|D_(U)) with a regularization term is expressed as follows, by usingEquations (22), (23), and (5).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{q_{k}^{\prime 1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{\overset{\sim}{r}\left( {x,y} \right)}\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}}}} +}} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{q_{k}^{\prime 1}\left( {{y❘x};\theta^{(k)}} \right)}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix} & (26)\end{matrix}$The gradient of Equation (26) is expressed as follows.

$\begin{matrix}{{\nabla_{k}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{\overset{\sim}{r}\left( {x,y} \right)}\left\lbrack g_{x,y}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{y}{{q_{k}^{\prime 1}\left( {{y❘x};\theta^{(k)}} \right)}\left\lbrack g_{x,y}^{(k)} \right\rbrack}}}}} & (27)\end{matrix}$

Using q²

When q² _(k) is used as an auxiliary model, the empirical generalizedrelative entropy U(Θ|D_(U)) with a regularization term is expressed asfollows, by using Equations (22), (23), and (7).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {{\overset{\sim}{r}\left( {x,z} \right)}{}{q_{k}^{2}\left( {{z❘x};\theta^{(k)}} \right)}} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}} +}} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{\log\left\lbrack {{b(z)} + {\exp\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix} & (28)\end{matrix}$The gradient of Equation (28) is expressed as follows.

$\begin{matrix}{{\nabla_{k}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,y} \right)}\left\lbrack g_{x,z}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{q_{k}^{2}\left( {{z❘x};\theta^{(k)}} \right)}\left\lbrack g_{x,z}^{(k)} \right\rbrack}}}}} & (29)\end{matrix}$

Using q′2

When q′² _(k) is used, the empirical generalized relative entropyU(Θ|D_(U)) with a regularization term is expressed as follows, by usingEquations (22), (23), and (8).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {{\overset{\sim}{r}\left( {x,z} \right)}{}{q_{k}^{\prime 2}\left( {{z❘x};\theta^{(k)}} \right)}} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta^{(k)} \cdot g_{x,z}^{(k)}} \right\rbrack}}}} +}} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 2}\left( {{z❘x};\theta^{(k)}} \right)}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix} & (30)\end{matrix}$The gradient of Equation (30) is expressed as follows.

$\begin{matrix}{{\nabla_{k}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack g_{x,z}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{q_{k}^{\prime 2}\left( {{z❘x};\theta^{(k)}} \right)}\left\lbrack g_{x,z}^{(k)} \right\rbrack}}}}} & (31)\end{matrix}$

If the auxiliary model is of type 2, the whole output structure y is notcalculated, and local structures z are calculated. With thisconfiguration, type 2 is expected to provide higher speed than type 1.Similarly to when an auxiliary model of type 1 is used, when anauxiliary model of type 2 is used, the optimum value can be obtained bya gradient-based optimization method.

Using q³

When q³ _(k) is used as an auxiliary model, the empirical generalizedrelative entropy U(Θ|D_(U)) with a regularization term is expressed asfollows, by using Equations (22), (23), and (10).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {{\overset{\sim}{r}\left( {x,z} \right)}{}{q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)}} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}}}} +}} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{\log\left\lbrack {{b(z)} + {\exp\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix} & (32)\end{matrix}$The gradient of Equation (32) is expressed as follows.

$\begin{matrix}{{\nabla_{k,n}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta_{n}^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack g_{x,z,n}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{q_{k}^{3}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}\left\lbrack g_{x,z,n}^{(k)} \right\rbrack}}}}} & (33)\end{matrix}$

Using q′³

When q′³ _(k) is used, the empirical generalized relative entropyU(Θ|D_(U)) with a regularization term is expressed as follows, by usingEquations (22), (23), and (11).

$\begin{matrix}\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,z} \right)}||{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta^{(k)}}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix} & (34)\end{matrix}$The gradient of Equation (34) is expressed as follows.

$\begin{matrix}{{\nabla_{k,n}{U\left( {\Theta ❘D_{U}} \right)}} = {{C_{U}{\sigma\left( \theta_{n}^{(k)} \right)}} - {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack g_{x,z,n}^{(k)} \right\rbrack}}} + {\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}\left\lbrack g_{x,z,n}^{(k)} \right\rbrack}}}}} & (35)\end{matrix}$When an auxiliary model of type 3 is used, the need for considering thedependency between parameters is eliminated. The solution can beobtained by a linear search with a single variable. This greatlysimplifies the calculation of the gradient, which occupies most of thecalculation in optimizing the numeric value. Accordingly, it is verysuitable to use an auxiliary model of type 3 in terms of the calculationcost in learning.

No matter whether an auxiliary model of type 1, 2, or 3 is used, theempirical generalized relative entropy U(Θ|D_(U)) with a regularizationterm becomes a convex function for each parameter. Therefore, a singleoptimum solution is determined.

Configuration of the Auxiliary Model Parameter Estimating Unit 140 andProcessing Flow

The auxiliary model parameter estimating unit 140 will be described withreference to FIGS. 17 and 18. To estimate the set Θ of auxiliary modelparameter sets in accordance with a conditional random field, theauxiliary model parameter estimating unit 140 includes anempirical-generalized-relative-entropy-with-regularization-termcalculator 145, a gradient calculator 147, a second convergencedetermination unit 149, and a parameter updating unit 151. Theconditional random field is described in detail in F. Sha and F.Pereria, “Shallow Parsing with Conditional Random Fields”, Proceedingsof HLT/NAACL-2003, pages 134-141, 2003 (hereafter reference literature1), for example, and a description thereof will be omitted here.

The empirical-generalized-relative-entropy-with-regularization-termcalculator 145 receives the unsupervised data D_(U), the referencefunction {tilde over (r)}(x, y), and the set Θ of auxiliary modelparameter sets and calculates an empirical generalized relative entropyU(Θ|D_(U)) with a regularization term in Equation (24), (26), (28),(30), (32), or (34) (s145). Theempirical-generalized-relative-entropy-with-regularization-termcalculator 145 outputs the empirical generalized relative entropyU(Θ|D_(U)) with a regularization term to the gradient calculator 147.

To optimize (minimize) the empirical generalized relative entropyU(Θ|D_(U)) with a regularization term, a gradient-based numericaloptimization method such as L-BFGS can be used. L-BFGS is described inD. C. Liu and J. Nocedal, “On the Limited Memory BFGS Method for LargeScale Optimization”, Math. Programming, Ser. B, 1989, Volume 45, Issue3, pp. 503-528 (reference literature 2 hereafter), for example, and adescription thereof will be omitted here.

The gradient calculator 147 calculates the gradient of U(Θ|D_(U)). Thegradients of Equations (24), (26), (28), (30), (32), and (34) areexpressed by Equations (25), (27), (29), (31), (33), and (35),respectively.

The second convergence determination unit 149 determines whether thegradient ∇U(Θ|D_(U)) expressed by Equation (25), (27), (29), (31), (33),or (35) has converged (s149). When it is determined that the value ofthe gradient ∇U(Θ|D_(U)) has converged, the second convergencedetermination unit 149 outputs the set Θ* of auxiliary model parametersets at that time to the first convergence determination unit 180 andthe base-model parameter estimating unit 160. If it is not determinedthat the value of the gradient ∇U(Θ|D_(U)) has converged, the parameterupdating unit 151 updates the set Θ of auxiliary model parameter sets(s151).

Base-Model Parameter Estimating Unit 160

The base-model parameter estimating unit 160 estimates a base-modelparameter set λ that minimizes a predefined empirical risk function, byusing the supervised data D_(L) and the set Θ of auxiliary modelparameter sets (s160).

The risk function and regularization term can be defined in many ways.For example, they are defined as follows.

$\begin{matrix}{{\lambda^{*} = {\arg\;{\min\limits_{\lambda}{L\left( {{\lambda ❘\Theta},D_{L}} \right)}}}}{{L\left( {{\lambda ❘\Theta},D_{L}} \right)} = {{R\left( {{\lambda ❘\Theta},D_{L}} \right)} + {C_{L}{\Omega(\lambda)}}}}} & (41)\end{matrix}$Here, R(λ|Θ, D_(L)) represents an arbitrary risk function. The riskfunction is a function for estimating an error in learning. A smallervalue of the risk function indicates that learning is more successful.Like C_(U) in Equation (23), C_(L) is a hyper parameter to be adjustedmanually. Ω(λ) indicates a regularization term with respect to λ. An L₁norm regularization term or an L₂ norm regularization term is used, forexample.L1 norm regularization Ω(λ)=∥λ∥₁L2 norm regularization Ω(λ)=|λ|₂ ²  (42)

Using Negative Log Likelihood as the Risk Function

An example of using negative log likelihood as the risk function will bedescribed. In minimization of the negative regularized log likelihood,optimum parameters are obtained by the following equation.

$\begin{matrix}{{R\left( {{\lambda ❘\Theta},D_{L}} \right)} = {- {\sum\limits_{{({x,y})} \in D_{L}}{\log\frac{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}{\sum\limits_{y}{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}}}}}} & (43)\end{matrix}$This is a general optimization method when a probability model is usedas the base-model. The gradient of Equation (43) can be expressed asfollows.

$\begin{matrix}{{\nabla{R\left( {\lambda ❘\Theta} \right)}} = {{- {E_{{\overset{\sim}{P}}_{D_{L}}{({x,y})}}\lbrack f\rbrack}} + {E_{P{({{{y❘x};\lambda},\Theta})}}\lbrack f\rbrack}}} & (44)\end{matrix}$The gradient of the regularization term is expressed as follows.L1 norm regularization ∇Ω(λ)=σ(λ)L2 norm regularization ∇Ω(λ)=2λ  (45)When Equation (43) is substituted into Equation (41), the gradient ofEquation (41) is expressed as follows.

$\begin{matrix}{{\nabla{L\left( {\lambda ❘\Theta} \right)}} = {{- {E_{{\overset{\sim}{P}}_{D_{L}}{({x,y})}}\lbrack f\rbrack}} + {E_{P{({{{y❘x};\lambda},\Theta})}}\lbrack f\rbrack} + {C_{L}{\nabla{\Omega(\lambda)}}}}} & (46)\end{matrix}${tilde over (P)}_(D) _(L) (x, y) represents the empirical probability ofx and y appearing in the supervised data D_(L).

$E_{{\overset{\sim}{P}}_{D_{L}}{({x,y})}}\lbrack f\rbrack$represents the empirical expectation of the feature vector f in thesupervised data D_(L). Therefore,

$E_{{\overset{\sim}{P}}_{D_{L}}{({x,y})}}\lbrack f\rbrack$is a vector of the sum of the feature vectors appearing in thesupervised data D_(L) actually used in learning. The optimum parametersin Equation (41) are obtained when the gradient of Equation (46) becomes0. In the actual optimization, they can be obtained by a gradient-basednumerical optimization method such as L-BFGS (see reference literature2).

Estimation Based on Max-margin Model

The estimation of the base-model parameter set λ based on the max-marginprinciple will be described next. In this case, the risk function isexpressed as follows.

$\begin{matrix}{{{R\left( {{\lambda ❘\Theta},D_{L}} \right)} = {\sum\limits_{{({x,y})} \in D_{L}}{\max\left\lbrack {0,{{E\left( {y,\hat{y}} \right)} - {d\left( {x,{y;\lambda},\Theta} \right)} + {d\left( {x,{\hat{y};\lambda},\Theta} \right)}}} \right\rbrack}}}\mspace{20mu}{\hat{y} = {{\underset{y^{\prime} \in {{Y{(x)}}\backslash y}}{\arg\;\max}{d\left( {x,{y^{\prime};\lambda},\Theta} \right)}} + {E\left( {y,y^{\prime}} \right)}}}} & (47)\end{matrix}$

When Equation (47) and the L₂ regularization term of Equation (45) aresubstituted into Equation (41), the gradient of Equation (41) isexpressed as follows.

            (48)${\nabla{L\left( {\lambda ❘\Theta} \right)}} = \left\{ \begin{matrix}{{- f_{x,y}} + f_{x,\hat{y}} - {\sum\limits_{k = 1}^{K}{\theta^{(k)} \cdot g_{x,y}^{(k)}}} + {\sum\limits_{k = 1}^{K}{\theta^{(k)} \cdot g_{x,\hat{y}}^{(k)}}} + {2C_{L}\lambda}} & {{{if}\mspace{14mu}{R\left( {{\lambda ❘\Theta},D_{L}} \right)}} > 0} \\0 & {otherwise}\end{matrix} \right.$If R(λ|Θ, D_(L))=0, L(λ|Θ, D_(L)) cannot be differentiated. This meansthat L(λ|Θ, D_(L)) cannot be optimized by a normal gradient method. Theoptimization of L(λ|Θ, D_(L)) can be provided by using a subgradientmethod.

Configuration of the Base-model Parameter Estimating Unit 160 andProcessing Flow

The base-model parameter estimating unit 160 will be described withreference to FIGS. 19 and 20. To estimate the base-model parameter set λin the conditional random field (refer to reference literature 1), thebase-model parameter estimating unit 160 includes an empirical riskfunction calculator 161, a regularization term calculator 163, anempirical-risk-function-with-regularization-term calculator 165, agradient calculator 167, a third convergence determination unit 169, anda parameter updating unit 171, as shown in FIG. 19.

The empirical risk function calculator 161 receives the supervised dataD_(L), the set Θ of auxiliary model parameter sets, and the base-modelparameter set λ, and calculates the empirical risk function R(λ|Θ,D_(L)) of Equation (43) or (47) (s161). The empirical risk functioncalculator 161 outputs R(λ|Θ, D_(L)) to theempirical-risk-function-with-regularization-term calculator 165.

The regularization term calculator 163 receives the base-model parameterset λ, and calculates the regularization term Ω(λ) of Equation (42)(s163). The regularization term calculator 163 outputs Ω(λ) to theempirical-risk-function-with-regularization-term calculator 165.

The empirical-risk-function-with-regularization-term calculator 165receives the empirical risk function R(λ|Θ, D_(L)) and theregularization term Ω(λ), substitutes them into Equation (41), andcalculates the empirical risk function L(λ|Θ, D_(L)) with aregularization term (s165). Theempirical-risk-function-with-regularization-term calculator 165 outputsthe empirical risk function L(λ|Θ, D_(L)) with the regularization termto the gradient calculator 167.

To optimize the empirical risk function L(λ|Θ, D_(L)) with theregularization term, a gradient-based numerical optimization method suchas L-BFGS can be used. L-BFGS is described in reference literature 2,and a description thereof will be omitted here.

The gradient calculator 167 calculates the gradient ∇L(λ|Θ, D_(L)) ofEquation (46) or (48) (s167).

The third convergence determination unit 169 determines whether thegradient ∇L(λ|Θ, D_(L)) of Equation (46) or (48) has converged (s169).When it is determined that the value of the gradient ∇L(λ|Θ, D_(L)) hasconverged, the third convergence determination unit 169 outputs thebase-model parameter set λ* at that time to the first convergencedetermination unit 180.

If it is not determined that the value of the gradient ∇L(λ|Θ, D_(L))has converged, the parameter updating unit 171 updates the base-modelparameter set λ (s171).

First Convergence Determination Unit 180

The first convergence determination unit 180 receives the base-modelparameter set λ and the set Θ of auxiliary model parameter sets anddetermines whether the values have converged (s180). The convergencedetermination unit in what is claimed below corresponds to the firstconvergence determination unit 180.

The first convergence determination unit 180 makes the determination byusing an increment of a parameter, for example. If the value of|λ^((t))−λ^((t+1))|+|Θ^((t))−Θ^((t+1))| becomes a threshold or below,the first convergence determination unit 180 determines that the valueshave converged. The first convergence determination unit 180 may alsodetermine that convergence has been made when a repetition count treaches a predetermined repetition count T (t=T).

If it is determined that convergence has not been reached, the firstconvergence determination unit 180 outputs to the auxiliary modelparameter estimating unit 140 a control signal c to repeat theprocessing to estimate the set Θ of auxiliary model parameter sets andthe base-model parameter set λ. The first convergence determination unit180 increments the parameter t, indicating the repetition count, by 1(t←t+1). The preceding base-model P(x, y; λ^(t−1), Θ^(t−1)) in therepeated processing may be output as the reference function {tilde over(r)}.

When it is determined that convergence has been reached, the firstconvergence determination unit 180 outputs the set Θ of auxiliary modelparameter sets and the base-model parameter set A, to the parameterintegrating unit 190.

Parameter Integrating Unit 190

The parameter integrating unit 190 integrates the base-model parameterset λ and the set Θ of auxiliary model parameter sets converged (s190).

If the j-th feature of the base-model matches the p-th feature of thek-th auxiliary model, the parameter integrating unit 190 obtains thei-th element u_(i) of the integrated parameter set u, as given by theequation below. FIG. 21 shows data examples of the parameter set u.u _(i) =w _(j) +v ^(k)θ^((k)) _(p)  (51)If only the base-model has a feature corresponding to the i-th elementu_(i), the parameter integrating unit 190 obtains the element u_(i) asgiven by the equation below.u_(i)=w_(j)  (52)If only the auxiliary model has a feature corresponding to the i-thelement u_(i), the element u_(i) is obtained as given by the equationbelow.u _(i)=v^(k)θ^((k)) _(p)  (53)The structured prediction model can be expressed by the equation givenbelow, instead of Equation (1).

$\begin{matrix}{\hat{y} = {\arg{\max\limits_{y \in {Y{(x)}}}{d\left( {x,{y;u}} \right)}}}} & (1)^{''}\end{matrix}$If the number of elements of the parameter set u is I when thebase-model parameter set λ and the set Θ of auxiliary model parametersets are integrated, the parameter set u is expressed by u={u₁, u₂, . .. , u_(I)}.

The parameter integrating unit 190 outputs the integrated parameter setu, or a structured prediction model expressed by using the parameter setu, to the structured prediction model memory 6. In this embodiment, manyof θ^((k)) ₁ are zero (the parameters are not active). In this state,u_(i) obtained by Equation (52) is also a zero parameter.

Effects

With this configuration, an inactive parameter (a zero parameter, inother words) and a feature corresponding to the parameter are deletedfrom the learned structured prediction model, and the memory spacerequired to store the learned structured prediction model can bereduced. The structured prediction data created in accordance with thesupervised data and unsupervised data maintain high predictionperformance, but the memory space required to store it is reduced. Theexperimental results will be described later. Since the cost ofgenerating the supervised data D_(L) is high, not so many elements areobtained from the supervised data D_(L), as elements of the base-modelparameter set. On the other hand, a great number of elements can beobtained easily from the unsupervised data D_(U), as elements of the setΘ of auxiliary model parameter sets. By making most of the elements ofthe set Θ to zero, the required memory space can be saved.

Hardware Configuration

FIG. 22 is a block diagram showing the hardware configuration of thestructured prediction model learning apparatus 100 in this embodiment.As shown in FIG. 22, the structured prediction model learning apparatus100 includes a central processing unit (CPU) 11, an input unit 12, anoutput unit 13, auxiliary storage 14, a read-only memory (ROM) 15, arandom access memory (RAM) 16, and a bus 17.

The CPU 11 includes a controller 11 a, an operation unit 11 b, andregisters 11 c and executes a variety of operations in accordance withprograms read into the registers 11 c. The input unit 12 is an inputinterface, a keyboard, a mouse, or the like, by which data is input, andthe output unit 13 is an output interface, a display unit, a printer, orthe like, by which data is output. The auxiliary storage 14 is a harddisk drive, a semiconductor memory, or the like and stores various dataand programs for operating a computer as the structured prediction modellearning apparatus 100. The programs and data are expanded in the RAM 16and used by the CPU 11 or the like. The bus 17 connects the CPU 11, theinput unit 12, the output unit 13, the auxiliary storage 14, the ROM 15,and the RAM 16 to allow communication among them. Examples of hardwareof that type include a personal computer, a server apparatus, and aworkstation.

Program Configuration

The auxiliary storage 14 stores programs to execute all types ofprocessing in the structured prediction model learning apparatus 100 inthis embodiment, as described above. Each program constituting astructured prediction program may be written as a single programsequence, and some of the programs may be stored as separate modules ina library.

Cooperation Between Hardware and Programs

The CPU 11 loads and expands the program or data read from the auxiliarystorage 14 into the RAM 16, in accordance with the read OS program. Theaddresses of locations in the RAM 16 where the programs and data arewritten are stored in the registers 11 c of the CPU 11. The controller11 a of the CPU 11 reads the addresses stored in the registers 11 csuccessively, reads the programs and data from the correspondinglocations in the RAM 16, has the operation unit 11 b execute theoperation indicated by the programs, and stores the results of operationin the registers 11 c.

FIG. 5 is a block diagram showing an example functional configuration ofthe structured prediction model learning apparatus 100 implemented byexecuting the programs read in the CPU 11.

The memory 103 is any of the auxiliary storage 14, the RAM 16, theregisters 11 c, and other types of buffers, cache memories, and the likeor is a storage area using some of them. The output candidate graphgenerator 110, the feature vector generator 120, the parameter generator130, the auxiliary model parameter estimating unit 140, the base-modelparameter estimating unit 160, the first convergence determination unit180, and the parameter integrating unit 190 are implemented by executingthe structured prediction program by the CPU 11.

Results of Experiment

In FIG. 23, the chain line represents the accuracy rate of thestructured prediction system using a structured prediction model learnedfrom the supervised data alone, and the solid line represents theaccuracy rate of the structured prediction system using a structuredprediction model learned by the structured prediction model learningapparatus 100 using auxiliary models of type 3. The accuracy rate of thestructured prediction system using the structured prediction modellearned by the structured prediction model learning apparatus 100 ishigher, irrespective of the number of elements in the parameter. At anaccuracy rate less than 92.5%, the number of elements in the parameterset u used in the structured prediction model learned by the structuredprediction model learning apparatus 100 is about one tenth of the numberof elements in the parameter set used in the structured prediction modellearned from the supervised data alone.

Modification

The structured prediction model learning apparatus 100, the learningsupport information memory 4, the structured prediction model memory 6,and the structured prediction system 7 may be combined and implementedon a single computer.

The structured prediction model learning apparatus 100 can be used for aproblem other than sequence structured prediction problems, if the setof output candidate definition data and the feature extraction templateare changed to ones suitable for the problem. FIG. 24 shows outputcandidate graphs for linking structure prediction problems.

The feature vector generator 120 may combine a set of features extractedfrom the supervised-data output candidate graph Gr_(DL) and a set offeatures extracted from the unsupervised-data output candidate graphGr_(DU). The feature vector generator 120 deletes identical featuresfrom the combined feature set and generates a common feature set. Thebase-model feature vector f_(x,y) is a D-dimensional vector thatincludes elements in one-to-one correspondence with the elements of thecommon feature set. The auxiliary model feature vector g^((k)) _(x,y) isa D_(k)-dimensional vector that includes elements in one-to-onecorrespondence with the elements of a subset of the common feature set.The subset is obtained by dividing the common feature set into K parts.The degrees of the vectors are D=D₁+D₂+ . . . +D_(DK). In this case, theparameter integrating unit 190 uses just Equation (51).

The auxiliary model parameter estimating unit 140 estimates a set ofauxiliary model parameter sets which minimizes the empirical generalizedrelative entropy U(Θ|D_(U)) with the regularization term, by using thegradient of the empirical generalized relative entropy U(Θ|D_(U)) withthe regularization term expressed by Equation (24), (26), (28), (30),(32), or (34). This estimation may be made by another method that doesnot use the gradient. The base-model parameter estimating unit 160estimates a base-model parameter set that minimizes the empirical riskfunction L(λ|Θ, D_(L)) with the regularization term, by using thegradient of the empirical risk function L(λ|Θ, D_(L)) with theregularization term expressed by Equation (41). This estimation may bemade by another method that does not use the gradient.

What is claimed is:
 1. A structured prediction model learning apparatus,having a central processing unit, for learning a structured predictionmodel used to predict an output structure y corresponding to an inputstructure x, by using supervised data D_(L) and unsupervised data D_(U),the structured prediction model learning apparatus comprising: an outputcandidate graph generator implemented by the central processing unit togenerate a supervised data output candidate graph for the superviseddata and an unsupervised data output candidate graph for theunsupervised data, by using a set of definition data for generatingoutput candidates identified by a structured prediction problem; afeature vector generator extracting features from the supervised dataoutput candidate graph and the unsupervised data output candidate graphby using a feature extraction template, generating a D-dimensionalbase-model feature vector f_(x,y) corresponding to a set of the featuresextracted from the supervised data output candidate graph, dividing aset of the features extracted from the unsupervised data outputcandidate graph into K subsets, and generating a D_(k)-dimensionalauxiliary model feature vector g^((k)) _(x,y) corresponding to featuresincluded in a subset k of the K subsets, where K is a natural number andkε{1, 2, . . . , K}; a parameter generator generating a base-modelparameter set λ which includes a first parameter set w formed of D firstparameters in one-to-one correspondence with D elements of thebase-model feature vector f_(x,y), generating an auxiliary modelparameter set θ^((k)) formed of D_(k) auxiliary model parameters inone-to-one correspondence with D_(k) elements of the auxiliary modelfeature vector g^((k)) _(x,y), and to generate a set Θ={θ⁽¹⁾, θ⁽²⁾, . .. , θ^((K))} of auxiliary model parameter sets, formed of K auxiliarymodel parameter sets θ^((k)); an auxiliary model parameter estimatingunit estimating the set Θ of auxiliary model parameter sets whichminimizes the Bregman divergence having a regularization term obtainedfrom the auxiliary model parameter set θ^((k)), between each auxiliarymodel q_(k) and a reference function {tilde over (r)} (x,y) which is anonnegative function and indicates the degree of pseudo accuracy of theoutput structure y corresponding to the input structure x, by using theregularization term and the unsupervised data D_(U), where the auxiliarymodel q_(k) is obtained by defining the auxiliary model parameter setθ^((k)) with a log-linear model; and a base-model parameter estimatingunit estimating a base-model parameter set λ which minimizes anempirical risk function defined beforehand, by using the supervised dataD_(L) and the set Θ of auxiliary model parameter sets, where thebase-model parameter set λ includes a second parameter set v={v₁, v₂, .. . , v_(K)} formed of K second parameters in one-to-one correspondencewith K auxiliary models; wherein the auxiliary model parameterestimating unit uses the auxiliary model parameter set θ^((k)) to obtainan L₁ norm regularization term |θ^((k))|₁, obtains the Bregmandivergence having the regularization term as the following empiricalgeneralized relative entropy having a regularization term${U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{G}}_{D_{U}}\left( \overset{\sim}{r}||q_{k} \right)}}}$where C_(u) is a hyper parameter and Ĝ_(D) _(U) ({tilde over (r)}∥q_(k))is a generalized relative entropy obtained by using the unsuperviseddata D_(U), and estimates the set Θ of auxiliary model parameter setswhich minimizes the empirical generalized relative entropy having theregularization term.
 2. The structured prediction model learningapparatus according to claim 1, further comprising: a convergencedetermination unit determining whether the values of the base-modelparameter set λ, and the set Θ of auxiliary model parameter sets haveconverged; and a parameter integrating unit integrating the convergedbase-model parameter set λ and the converged set Θ of auxiliary modelparameter sets; wherein, in a case where the convergence determinationunit determines that the values of the base-model parameter set λ andthe set Θ of auxiliary model parameter sets have not converged, theauxiliary model parameter estimating unit repeatedly estimates the set Θof auxiliary model parameter sets, and the base-model parameterestimating unit repeatedly estimates the base-model parameter set λ; andthe reference function {tilde over (r)} is a base-model P usedimmediately before the current repetition of model parameter setestimation.
 3. The structured prediction model learning apparatusaccording to claim 1, wherein the base-model parameter estimating unituses a regularization term Ω(λ) obtained from the base-model parameterset λ to obtain the empirical risk function as the following empiricalrisk function having a regularization termL(λ|Θ, D _(L))=R(λ|Θ, D _(L))+C _(L)Ω(λ) where C_(L) is a hyperparameter; in a case where a negative log likelihood is used as theempirical risk function, the following is used${{R\left( {{\lambda ❘\Theta},D_{l}} \right)} = {- {\sum\limits_{{({x,y})} \in D_{L}}{\log\frac{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}{\sum\limits_{y}{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}}}}}};$where d(x, y; λ, Θ) represents a discriminant function that returns ascore indicating a likelihood of obtaining the output structure y withrespect to the input structure x; and in a case where a base-modelparameter set λ which minimizes L(λ|Θ, D_(L)) is estimated based onmargin maximization in a linear identification model, the following areused${R\left( {{\lambda ❘\Theta},D_{L}} \right)} = {\sum\limits_{{({x,y})} \in D_{L}}{\max\left\lbrack {0,{{E\left( {y,\hat{y}} \right)} - {d\left( {x,{y;\lambda},\Theta} \right)} + {d\left( {x,{\hat{y};\lambda},\Theta} \right)}}} \right\rbrack}}$$\mspace{79mu}{\hat{y} = {{\arg{\max\limits_{y^{\prime} \in {{Y{(x)}}\backslash y}}{d\left( {x,{y^{\prime};\lambda},\Theta} \right)}}} + {{E\left( {y,y^{\prime}} \right)}.}}}$where E(y, ŷ) is a function expressing the degree of error for ŷobtained by comparing a correct output y with a certain output ŷ andY(x)\y represents a difference set obtained by subtracting the outputstructure y corresponding to the input structure x from a set Y(x) ofall possible outputs with respect to x.
 4. The structured predictionmodel learning apparatus according to claim 1, wherein the auxiliarymodel parameter estimating unit obtains the empirical generalizedrelative entropy having the regularization term as one of the following,where the conditional probability of outputting the output structure yin a case where the input structure x is given is q¹ _(k)(y|x; θ^((k)))and the odds of q¹ _(k) are q′¹ _(k), $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{y}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ and $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{y}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{q_{k}^{\prime 1}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ where b(y) is a function that returns a value more orequals to 1 and const(θ^((k))) is a collective value of constant termswith respect to θ^((k)); and the auxiliary model parameter estimatingunit estimates the set Θ of auxiliary model parameter sets whichminimizes the empirical generalized relative entropy.
 5. The structuredprediction model learning apparatus according to claim 1, wherein theauxiliary model parameter estimating unit obtains the empiricalgeneralized relative entropy having the regularization term as one ofthe following, where the conditional probability of outputting a localstructure z in the output structure y in a case where the inputstructure x is given is q² _(k)(z|x; θ^((k))) and the odds of q² _(k)are q′² _(k), $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,z} \right)}||{q_{k}^{2}\left( {{z❘x},{n;\theta^{(k)}}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}} \\{{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ and $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{2}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 2}\left( {{z❘x},{n;\theta^{(k)}}} \right)}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix}$ where b(z) represents the number of local structures thatare rival candidates of the local structure z and const(θ^((k))) is acollective value of constant terms with respect to θ^((k)); and theauxiliary model parameter estimating unit estimates the set Θ ofauxiliary model parameter sets which minimizes the empirical generalizedrelative entropy.
 6. The structured prediction model learning apparatusaccording to claim 1, wherein the auxiliary model parameter estimatingunit obtains the empirical generalized relative entropy having theregularization term as one of the following, where the conditionalprobability of outputting a local structure z having a feature n in theoutput structure y in a case where the input structure x is given is q³_(k)(z|x, n; θ^((k))) and the odds of q³ _(k) are q′³ _(k),$\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,z} \right)}||{q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}} \\{{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ and $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{3}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ where n=1, 2, . . . , D_(k), θ^((k))=(θ^((k)) ₁, θ^((k))₂, . . . , θ^((k)) _(Dk)), g^((k)) _(x,z)=(g^((k)) _(x,z,1), g^((k))_(x,z,2), . . . , g^((k)) _(x,z,Dk)), b(z) represents the number oflocal structures that are rival candidates of the local structure z andconst(θ^((k)) _(n)) is a collective value of constant terms with respectto θ^((k)) _(n); and the auxiliary model parameter estimating unitestimates the set Θ of auxiliary model parameter sets which minimizesthe empirical generalized relative entropy.
 7. A structured predictionmodel learning method for learning a structured prediction model used topredict an output structure y corresponding to an input structure x, byusing supervised data D_(L) and unsupervised data D_(U), the structuredprediction model learning method comprising: an output candidate graphgenerating step of generating a supervised data output candidate graphfor the supervised data and an unsupervised data output candidate graphfor the unsupervised data, by using a set of definition data forgenerating output candidates identified by a structured predictionproblem; a feature vector generating step of extracting features fromthe supervised data output candidate graph and the unsupervised dataoutput candidate graph by using a feature extraction template,generating a D-dimensional base-model feature vector f_(x,y)corresponding to a set of the features extracted from the superviseddata output candidate graph, dividing a set of the features extractedfrom the unsupervised data output candidate graph into K subsets, andgenerating a D_(k)-dimensional auxiliary model feature vector g^((k))_(x,y) corresponding to features included in a subset k of the Ksubsets, where K is a natural number and kε{1, 2, . . . , K}; aparameter generating step of generating a base-model parameter set λwhich includes a first parameter set w formed of D first parameters inone-to-one correspondence with D elements of the base-model featurevector f_(x,y), generating an auxiliary model parameter set θ^((k))formed of D_(k) auxiliary model parameters in one-to-one correspondencewith D_(k) elements of the auxiliary model feature vector g^((k))_(x,y), and generating a set Θ={θ⁽¹⁾, θ⁽²⁾, . . . , θ^((K))} ofauxiliary model parameter sets, formed of K auxiliary model parametersets θ^((k)); an auxiliary model parameter estimating step of estimatingthe set Θ of auxiliary model parameter sets which minimizes the Bregmandivergence having a regularization term obtained from the auxiliarymodel parameter set θ^((k)), between each auxiliary model q_(k) and areference function {tilde over (r)}(x, y) which is a nonnegativefunction and indicates the degree of pseudo accuracy of the outputstructure y corresponding to the input structure x, by using theregularization term and the unsupervised data D_(U), where the auxiliarymodel q_(k) is obtained by defining the auxiliary model parameter setθ^((k)) with a log-linear model; and a base-model parameter estimatingstep of estimating a base-model parameter set λ which minimizes anempirical risk function defined beforehand, by using the supervised dataD_(L) and the set Θ of auxiliary model parameter sets, where thebase-model parameter set λ includes a second parameter set v={v₁, v₂, .. . , v_(K)} formed of K second parameters in one-to-one correspondencewith K auxiliary models; wherein, in the auxiliary model parameterestimating step, the auxiliary model parameter set θ^((k)) is used toobtain an L₁ norm regularization term |θ^((k))|₁, the Bregman divergencehaving the regularization term is obtained as the following empiricalgeneralized relative entropy having a regularization term${U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{G}}_{D_{U}}\left( \overset{\sim}{r}||q_{k} \right)}}}$where C_(U) is a hyper parameter and Ĝ_(D) _(U) ({tilde over (r)}∥q_(k))is a generalized relative entropy obtained by using the unsuperviseddata D_(U), and the set Θ of auxiliary model parameter sets whichminimizes the empirical generalized relative entropy having theregularization term is estimated.
 8. The structured prediction modellearning method according to claim 7, further comprising: a convergencedetermination step of determining whether the values of the base-modelparameter set λ and the set Θ of auxiliary model parameter sets haveconverged; and a parameter integrating step of integrating the convergedbase-model parameter set λ and the converged set Θ of auxiliary modelparameter sets; wherein, in a case where it is determined in theconvergence determination step that the values of the base-modelparameter set λ and the set Θ of auxiliary model parameter sets have notconverged, the set Θ of auxiliary model parameter sets is repeatedlyestimated in the auxiliary model parameter estimating step and thebase-model parameter set λ is repeatedly estimated in the base-modelparameter estimating step; and the reference function {tilde over (r)}is a base-model P used immediately before the current repetition ofmodel parameter set estimation.
 9. The structured prediction modellearning method according to claim 7, wherein, in the base-modelparameter estimating step, a regularization term Ω(λ) obtained from thebase-model parameter set λ is used to obtain the empirical risk functionas the following empirical risk function having a regularization termL(λ|Θ, D _(L))=R(λ|Θ, D _(L))+C _(L)Ω(λ) where C_(L) is a hyperparameter; in a case where a negative log likelihood is used as theempirical risk function, the following is used${{R\left( {{\lambda ❘\Theta},D_{l}} \right)} = {- {\sum\limits_{{({x,y})} \in D_{L}}{\log\frac{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}{\sum\limits_{y}{\exp\left\lbrack {d\left( {x,{y;\lambda},\Theta} \right)} \right\rbrack}}}}}};$where d(x, y; λΘ) represents a discriminant function that returns ascore indicating a likelihood of obtaining the output structure y withrespect to the input structure x; and in a case where a base-modelparameter set λ which minimizes L(λ|Θ, D_(L)) is estimated based onmargin maximization in a linear identification model, the following areused${R\left( {{\lambda ❘\Theta},D_{L}} \right)} = {\sum\limits_{{({x,y})} \in D_{L}}{\max\left\lbrack {0,{{E\left( {y,\hat{y}} \right)} - {d\left( {x,{y;\lambda},\Theta} \right)} + {d\left( {x,{\hat{y};\lambda},\Theta} \right)}}} \right\rbrack}}$$\mspace{79mu}{\hat{y} = {{\arg{\max\limits_{y^{\prime} \in {{Y{(x)}}\backslash y}}{d\left( {x,{y^{\prime};\lambda},\Theta} \right)}}} + {{E\left( {y,y^{\prime}} \right)}.}}}$where E(y, {tilde over (y)}) is a function expressing the degree oferror for {tilde over (y)} obtained by comparing a correct output y witha certain output {tilde over (y)} and Y(x)\y represents a difference setobtained by subtracting the output structure y corresponding to theinput structure x from a set Y(x) of all possible outputs with respectto x.
 10. The structured prediction model learning method according toclaim 7, wherein, in the auxiliary model parameter estimating step, theempirical generalized relative entropy having the regularization term isobtained as one of the following, where the conditional probability ofoutputting the output structure y in a case where the input structure xis given is q¹ _(k)(y|x; θ^((k))) and the odds of q¹ _(k) are q′¹ _(k),$\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{q_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{y}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ and $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{1}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{y}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{y}{q_{k}^{\prime 1}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ where b(y) is a function that returns a value more orequals to 1 and const(θ^((k))) is a collective value of constant termswith respect to θ^((k)); and the set Θ of auxiliary model parameter setswhich minimizes the empirical generalized relative entropy is estimated.11. The structured prediction model learning method according to claim7, wherein, in the auxiliary model parameter estimating step, theempirical generalized relative entropy having the regularization term isobtained as one of the following, where the conditional probability ofoutputting a local structure z in the output structure y in a case wherethe input structure x is given is q² _(k)(z|x; θ^((k))) and the odds ofq² _(k) are q′² _(k), $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,z} \right)}||{q_{k}^{2}\left( {{z❘x},{n;\theta^{(k)}}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}} \\{{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$ and $\begin{matrix}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{2}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 2}\left( {{z❘x},{n;\theta^{(k)}}} \right)}}}} + {{const}\left( \theta^{(k)} \right)}}\end{matrix}$ where b(z) represents the number of local structures thatare rival candidates of the local structure z and const(θ^((k))) is acollective value of constant terms with respect to θ^((k)); and the setΘ of auxiliary model parameter sets which minimizes the empiricalgeneralized relative entropy is estimated.
 12. The structured predictionmodel learning method according to claim 7, wherein, in the auxiliarymodel parameter estimating step, the empirical generalized relativeentropy having the regularization term is obtained as one of thefollowing, where the conditional probability of outputting a localstructure z having a feature n in the output structure y in a case wherethe input structure x is given is q³ _(k)(z|x, n; θ^((k))) and the oddsof q³ _(k) are q′³ _(K) $\begin{matrix}{\mspace{79mu}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,z} \right)}||{q_{k}^{3}\left( {{z❘x},{n;\theta^{(k)}}} \right)} \right)}}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}} \\{{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta^{(k)} \cdot g_{x,y}^{(k)}} \right\rbrack}} \right\rbrack} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$      and $\begin{matrix}{\mspace{79mu}{{U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{3}\left( {{y❘x};\theta^{(k)}} \right)} \right)}}}}} \\{= {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} - {\sum\limits_{k}{\sum\limits_{x \in D_{U}}\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}}}}} \\{{{\overset{\sim}{r}\left( {x,z} \right)}\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack} +} \\{{\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}\end{matrix}$${U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{3}\left( {{y❘x};\theta^{(k)}} \right)} \right)}} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{\log\left\lbrack {{b(y)} + {\exp\;\left\lbrack {\theta_{n}^{(k)} \cdot g_{x,z,n}^{(k)}} \right\rbrack}} \right\rbrack}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}$${U\left( {\Theta ❘D_{U}} \right)} = {{C_{U}{\sum\limits_{k}{\theta^{(k)}}_{1}}} + {\sum\limits_{k}{{\hat{K}}_{D_{U}}\left( {\overset{\sim}{r}\left( {x,y} \right)}||{{q^{\prime}}_{k}^{3}\left( {{y❘x};\theta^{(k)}} \right)} \right)}} + {\sum\limits_{k}{\sum\limits_{x \in D_{U}}{\sum\limits_{z \in {Z{({x,{Y{(x)}}})}}}{q_{k}^{\prime 3}\left( {{z❘x},{n;\theta_{n}^{(k)}}} \right)}}}} + {{const}\left( \theta_{n}^{(k)} \right)}}$where n=1, 2, . . . , D_(k), θ^((k))=(θ^((k)) ₁, θ^((k)) ₂, . . . ,θ^((k)) _(Dk)), g^((k)) _(x,z)=(g^((k)) _(x,z,1), g^((k)) _(x,z,2), . .. , g^((k)) _(x,z,Dk)), b(z) represents the number of local structuresthat are rival candidates of the local structure z and const(θ^((k))_(n)) is a collective value of constant terms with respect to θ^((k))_(n); and the set Θ of auxiliary model parameter sets which minimizesthe empirical generalized relative entropy is estimated.
 13. Anon-transitory computer-readable recording medium that records a programthat makes a computer function as the structured prediction modellearning apparatus according to claim 1.